Skip to content

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 149 commits into from

Conversation

jeffzhou2000
Copy link

@jeffzhou2000 jeffzhou2000 commented Mar 11, 2025

  • I have read the contributing guidelines
  • Self-reported review complexity:
    * [ ] Low
    * [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
    * [ ] High
  • Testing Done
    * [x] test-backend-ops and llama-cli through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
    * [x] test-backend-ops and llama-cli through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
    * [x] the major features in ggml backend subsystem through HWACCEL_CDSP(the main approach in this PR) has verified on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:

  • how to utilize the Qualcomm Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.

refer to other existing backends, this PR is the initial phase of ggml-hexagon backend for Qualcomm Hexagon NPU on Android phone. now it's already a functional/practical MVP(Minimum Viable PR) PR: it support GGML_OP_ADD and GGML_OP_MULMAT and can pass testops and llama-cli and performance of GGML_OP_ADD and GGML_OP_MUL_MAT with fp32 on cDSP side are both very positive.

the fully and TLDR description of this PR can be found at my forked llama.cpp project:jeffzhou2000#30.

the high-level data path or so-called high-level arch of ggml-hexagon can be found at my forked llama.cpp project:high-level data path of ggml-hexagon

Features

  • provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.

  • provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload some performance-sensitive ggml ops to Hexagon cDSP directly.

  • the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.

  • dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
    Screenshot from 2025-04-10 11-06-11

  • probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:
    #v68 --- Snapdragon 888
    #v69 --- Snapdragon 8 Gen1
    #v73 --- Snapdragon 8 Gen2
    #v75 --- Snapdragon 8 Gen3(verified)
    #v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
    Screenshot from 2025-03-31 20-36-09
    Screenshot from 2025-03-31 20-37-07

  • provide a customized tiny ggmldsp which is exactly borrowed/reused/ported from original ggml and running well /works fine on Hexagon cDSP side, this feature will be very helpful for domain experts or AI experts whom can do anything AI innovation with Qualcomm's amazing lightweight/low-level(C/C++ and HVX assemble and can operate hardware directly) Hexagon SDK on cDSP side directly rather than learning Qualcomm's highly-designed heavyweight/high-level QNN SDK API on ARM-AP side.

  • provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):

  • utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.

  • we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

    SM8450 (Snapdragon 8 Gen 1+)
    SM8550 (Snapdragon 8 Gen 2)
    SM8650 (Snapdragon 8 Gen 3)
    SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_testop          [ADD/MUL_MAT]
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench


we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.

case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference

a

case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP(small matrix mulmat through test-backend-ops)

a

[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.

a

a

the details and how to reproduce above results can be found at my forked llama.cpp project:jeffzhou2000#28.

Big picture of ggml-hexagon backend

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

  • general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
  • general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl.
  • special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024.
enum hwaccel_approach_type {
HWACCEL_QNN =0, (C API, before 03/11/2025, not easy because QNN SDK is a black-box or heavy SDK and many many tricks in the QNN SDK)
HWACCEL_QNN_SINGLEGRAPH=1,(C API, before 03/18/2025, very hard because the mechanism is a black black-box and workload is massive)
HWACCEL_CDSP=2,(C and assemble API, after 03/24/2025, hard but we can do anything on cDSP directly, because Hexagon SDK is a very lightweight/thin SDK and we can operate hardware directly through Hexagon SDK)
HWACCEL_SYCL=3,(this is personal proposal or assumption, general and modern C++ API, N/A at the moment because essential adaption layer should be provided by Qualcomm)
};

the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:jeffzhou2000#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:jeffzhou2000#28.

Acknowledgement

  1. the implementation of HWACCEL_QNN is mainly porting/reverse engineering from executorch(the implementation of QNN backend in executorch comes from Qualcomm). the implementation of HWACCEL_CDSP borrows some codes from Qualcomm's Hexagon SDK. one more important thing:I got breakthrough help from @chiwwang at Qualcomm Technologies Inc/Qualcomm Innovation Center on 04/2024. in the all: all fundamental techs of this topic(a specified ggml/llama.cpp backend for Qualcomm's Hexagon NPU) comes from Qualcomm.
  2. huge thanks to the excellent/great maintainers&original authors of ggml&llama.cpp,I learnt so much from ggml&llama.cpp: their open-minded spirits and standout contributions made a great public good for open-source community and our planet. one more important thing: the tiny ggml-dsp on Hexagon cDSP side(aka the existing implementation of hexagon kernels on cDSP side, because I'm not AI expert and this is a practical way for me) is completely ported/borrowed from the original ggml.
  3. huge thanks to a senior staff technical expert @max-krasnyansky from Qualcomm headquarter whom give an important/valuable/breakthrough guidance on direction on 03/18/2025:QNN is not the right solution here.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

  • AI experts must be involved in the rest parts of hexagon-kernels: AI experts only need to focus on hexagon-kernels, AI experts and other domain tech experts around the world can help to improve the hexagon-kernels(various mulmat and norm/rmsnorm/softmax/....), domain tech experts and AI experts can operate cDSP hardware directly and can do anything AI innovations through the lightweight and amazing Hexagon SDK on cDSP side.

[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!

@github-actions github-actions bot added build Compilation issues script Script related ggml changes relating to the ggml tensor library for machine learning testing Everything test related labels Mar 11, 2025
@Dampfinchen
Copy link

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

@jeffzhou2000
Copy link
Author

jeffzhou2000 commented Mar 12, 2025

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

thanks for your kind comment.

  1. Quacomm's Hexagon NPU support is really huge work for this project although now we clearly know the principle or know what, because Qualcomm provides some binary dedicated tools to do LLM model conversion in their dedicated AI sw stacks and some other closed-source implementation also use this similar approach exactly. so programmers must compose an ideal QNN graph according to the complete ggml cgraph manually in ggml-qnn backend if they use/chose the second tech approach in ggml-qnn backend("mapping the complete ggml cgraph to a single opcfg QNN graph"). there are 800+ cgraph nodes and 50+ ops in qwen1_5-1_8b-chat-q4_0.gguf, accordingly, "(Hexgon) NPU support is huge for this project", real AI experts must be involved in the rest parts of ggml-qnn.
  2. I think I can make it(ggml-exynos or ggml-samsung) work on Exynos 2200 if I can get a necessary phone(I can try to buy it) and SDK&tech docs(this might-be not easy because of strict IPR policy in some big IT companys as my personal understanding at the moment) and follow the principle "make it run and then make it right and finally make it fast",this is one of my areas of expertise.

jeffzhou2000

This comment was marked as resolved.

jeffzhou2000

This comment was marked as resolved.

jeffzhou2000

This comment was marked as resolved.

jeffzhou2000

This comment was marked as resolved.

@Jianhua-Cui
Copy link

Nice work!

However, will offloading MatMul to CDSP layer by layer incur a significant FastRPC overhead?

As far as I know, with zero-copy enabled, the overhead of a single FastRPC call is around 1–3ms (on the 8750 device).

Could you please share the end-to-end performance results?

@jeffzhou2000
Copy link
Author

jeffzhou2000 commented Mar 25, 2025

Nice work!

However, will offloading MatMul to CDSP layer by layer incur a significant FastRPC overhead?

1.can be avoid by the same mechanism in QNN SDK: shared-buffer or memory pool between AP and cDSP through ION memory or DMA memory.
2.a senior staff tech expert from Qualcomm's headquarter has told me: QNN is not the right solution here.
3.various local experiments has confirmed that QNN-CPU&QNN-NPU's performance is really bad then the default ggml cpu backend, and also really bad then general approach through cDSP directly.

As far as I know, with zero-copy enabled, the overhead of a single FastRPC call is around 1–3ms (on the 8750 device).

Could you please share the end-to-end performance results?

what's the means of "end-to-end performance results"?

  1. this link opencl: improve profiling #12442 (comment) might be useful for your question.

  2. verify the "end-to-end performance" manually.

git clone https://github.com/kantv-ai/ggml-hexagon
cd ggml-hexagon
git checkout pr_to_upstream
./scripts/build-run-android.sh build
./scripts/build-run-android.sh run_llamacli

modify the inference_approach manually for verify approach through QNN or cDSP:

zhouwg:$ more scripts/ggml-qnn.cfg 
[general]
#0: QNN-CPU backend
#1: QNN-GPU backend
#2: QNN-NPU backend
#3: default ggml backend
qnn_backend = 2

# enable/disable QNN's internal log
print_qnn_internal_log = 0

# enable/disable perf of op function
enable_perf = 1

# enable/disable print tensors info in op function
print_tensors_info = 0

# enable/disable dump op info in handle_op
dump_op_info = 0

# 0: general approach through QNN
# 1: general approach through Hexagon cDSP
# 2: special approach through QNN: mapping entire ggml cgraph to QNN graph(beyond scope of this PR)
inference_approach = 1

[npu]
hvx_threads = 4
vtcm_size_in_mb = 8
enable_dlbc = 1
precision_mode = "fp16"

@Jianhua-Cui
Copy link

Jianhua-Cui commented Mar 25, 2025

Thanks for your feedback!

Nice work!
However, will offloading MatMul to CDSP layer by layer incur a significant FastRPC overhead?

1.can be avoid by the same mechanism in QNN SDK: shared-buffer or memory pool between AP and cDSP through ION memory or DMA memory 2.a senior staff tech expert from Qualcomm's headquarter has told me: QNN is not the right solution here. 3.various local experiments has confirmed that QNN-CPU&QNN-NPU's performance is really bad then the default ggml cpu backend.

I understand using techs like shared-buffer will reduce fastrpc overhead (as I mentioned "with zero-copy enabled"). However, LLM inference requires offloading hundreds of layers. Even with shared-buffer, this overhead remains significant.So, what I’d like to ask is whether you have any other optimization strategies to address this issue.

I see that you have actually analyzed and found it infeasible to transfer the entire graph to QNN. However, if offloading is done on a per-layer basis, it seems that the overhead of these fastRPCs can never be optimized.

As far as I know, with zero-copy enabled, the overhead of a single FastRPC call is around 1–3ms (on the 8750 device).
Could you please share the end-to-end performance results?

what's the means of "end-to-end performance results"?

I'm sorry for this confusing question. End to End performance results means the prefill & decode speed (tokens/s). I see your comments mentioned some results, but these results seems incomplete.

git clone https://github.com/kantv-ai/ggml-hexagon
cd ggml-hexagon
git checkout pr_to_upstream
./scripts/build-run-android.sh build
./scripts/build-run-android.sh run_llamacli

modify the inference_approach manually for verify approach through QNN or cDSP:

zhouwg:$ more scripts/ggml-qnn.cfg 
[general]
#0: QNN-CPU backend
#1: QNN-GPU backend
#2: QNN-NPU backend
#3: default ggml backend
qnn_backend = 2

# enable/disable QNN's internal log
print_qnn_internal_log = 0

# enable/disable perf of op function
enable_perf = 1

# enable/disable print tensors info in op function
print_tensors_info = 0

# enable/disable dump op info in handle_op
dump_op_info = 0

# 0: general approach through QNN
# 1: general approach through Hexagon cDSP
# 2: special approach through QNN: mapping entire ggml cgraph to QNN graph(beyond scope of this PR)
inference_approach = 1

[npu]
hvx_threads = 4
vtcm_size_in_mb = 8
enable_dlbc = 1
precision_mode = "fp16"

Thank you for providing these running commands, I'll try it.

Additionally, I would like to provide an extra piece of opinion. The reason why performance on Qualcomm's official AI hub is good enough. I think it's because they convert the entire graph to QNN, which gives them three exclusive advantages in hardware utilization:

  1. The overhead of fastrpc is small;
  2. They can take advantage of HMX instructions and VTCM high-speed memory;
  3. Comprehensive graph optimization.

@Jianhua-Cui
Copy link

By the way, I took a quick look at the operator code you implemented on cdsp and it seems that you haven't used HVX intrinsics yet. This might be one of the reasons for the suboptimal performance.

I suggest you take a look at the linear algebra library provided by Qualcomm for cdsp. Although their performance is not great either.

@jeffzhou2000
Copy link
Author

jeffzhou2000 commented Mar 25, 2025

Additionally, I would like to provide an extra piece of opinion. The reason why performance on Qualcomm's official AI hub is good enough. I think it's because they convert the entire graph to QNN, which gives them three exclusive advantages in hardware utilization:

1. The overhead of fastrpc is small;

the overhead of FastRPC should be same in various tech approaches, I personally think the overhead through cDSP directly might-be minimum:

datapath through QNN:

user code(ggml-hexagon backend)  <------> QNN API <------> QNN SDK <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> Hexagon nn libs on cDSP

datapath through cDSP directly:

user code(ggml-hexagon backend, similar to  TEE CA) <------> FastRPC framework(user-space lib and kernel driver) in HLOS(here is Android OS) <------> embedded OS on cDSP <------> FastRPC framework on cDSP <------> user code on cDSP(hexagon kernels, similar to TEE TA, or opencl kernels in ggml-opencl, or cuda kernels in other backends)

I think that's why the senior staff tech expert from Qualcomm headquarter said "QNN is not the right solution here", in the fact, the NPU performance through QNN is really bad here(ggml / llama.cpp).

because we can't utilize the dedicated binary tools which provided by Qualcomm here, or we(llama.cpp community) can't re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp(mapping the all ggml ops and entire ggml cgraph to a single QNN graph is also not easy to Qualcomm's world-class engineering team, I'll prefer to do some adaptation efforts with Intel's sycl stack if I'm a regular employee of Qualcomm's AI team, that's a more practical and desire direction). accordingly, offload some performance-sensitive ggml ops to cDSP directly is a practical way/direction: we only need to focus on hexagon kernels through various highly-designed algorithms or HVX instructions on cDSP, this is also the key-reason why I think this PR should be approved(we don't need to do more complex things in ggml-qnn.cpp from now on).

at the same time, we can clearly see that the so-called FastRPC mechanism or framework is exactly similar to mechanism in TEE.

2. They can take advantage of HMX instructions and VTCM high-speed memory;

because QNN SDK's internal will indirectly calling Qualcomm's Hexagon nn libs on cDSP which might-be/should-be highly-optimized with HVX SIMD instructions(they have plenty of excellent software engineers or AI experts).

3. Comprehensive graph optimization.

I agree with your opinion: QNN's internal did some things with the specified single QNN graph you pointed out. pls refer to section "Big picture of ggml-hexagon backend" in this PR, or refer to:jeffzhou2000#24

@myan-o
Copy link

myan-o commented Mar 28, 2025

Can Hexagon SDK not be used in termux environment?
It doesn't seem to be provided in a zip format.
I'm in trouble because I only have an android.

@jeffzhou2000
Copy link
Author

jeffzhou2000 commented Mar 29, 2025

Can Hexagon SDK not be used in termux environment?

Hexagon SDK need to be used in a standard Linux environment(Ubuntu 20.04/22.04 is recommended) to build the entire ggml-hexagon source code in this PR, pls refer to section "How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone" in PR description.

btw, I'm not familiar with termux environment but I think that's a limited Linux running environment on Android device. in the fact, we can verify/validate this PR through the self-made script build-run-android.sh on a standard Linux OS(or through Linux VM on Window10/11, or through WSL2 on Windows10/11) easily and directly, so termux might-be not needed or suitable here(I'm not sure about this because I didn't verify this opinion).

It doesn't seem to be provided in a zip format. I'm in trouble because I only have an android.

currently, it need Qualcomm account to get the Hexagon SDK to build the entire ggml-hexagon source code, this is the key-reason why build-run-android.sh can't download Hexagon SDK automatically(the self-made script build-run-android.sh can download Android NDK and QNN SDK automatically to make workflow easily).

at the same time, I cannot share the local self-made zip format Hexagon SDK publicly or I cannot distribute the Hexagon SDK in another way, because the Hexagon SDK must be obtained with a Qualcomm Developer Account and the self-made local zip format Hexagon SDK contains/binds my personal Qualcomm account&license information.

this IPR policy is make sense(my personal opinion) because Qualcomm already(we developers and programmers must thanks to @slaren) public it's unlimited QNN SDK freely and it is clear that the Hexagon SDK is more valuable although the Hexagon SDK is also free to developers and Qualcomm need to know how many developers use their unique and valuable Hexagon SDK. of course, it will make developer's workflow more easily if Qualcomm can also public an unlimited Hexagon SDK because I also think "more people/companies use the world-class Snapdragon mobile/vehicle/desktop SoC is another key-point".

@myan-o
Copy link

myan-o commented Mar 29, 2025

@zhouwg

Thank you for your answer. It seems that the only way is to unzip it using googlecolab.

jeffzhou2000

This comment was marked as resolved.

@ggerganov ggerganov added Qualcomm NPU and removed build Compilation issues script Script related testing Everything test related labels Apr 2, 2025
@github-actions github-actions bot added build Compilation issues script Script related labels Apr 2, 2025
@myan-o
Copy link

myan-o commented Apr 8, 2025

@zhuwg

I would like to specify the termux lib path, but it is fixed and I would like to change it.

ggml-hexagon.cpp

        .runtimelib_path        = "/data/local/tmp/",

QNN_DEFAULT_LIB_SEARCH_PATH

@jeffzhou2000
Copy link
Author

jeffzhou2000 commented Apr 8, 2025

@zhuwg

I would like to specify the termux lib path, but it is fixed and I would like to change it.

ggml-hexagon.cpp

        .runtimelib_path        = "/data/local/tmp/",

QNN_DEFAULT_LIB_SEARCH_PATH

thanks for your comment,
might-be we can add a runtime configure item in scripts/ggml-hexagon.cfg, then user can dynamically adjust runtime lib path accordingly.

what's your suggestion for this problem? thanks.

@myan-o
Copy link

myan-o commented Apr 8, 2025

@zhuwg

I would like to specify the termux lib path, but it is fixed and I would like to change it.

ggml-hexagon.cpp

        .runtimelib_path        = "/data/local/tmp/",

QNN_DEFAULT_LIB_SEARCH_PATH

thanks for your comment,
might-be we can add a runtime configure item in scripts/ggml-hexagon.cfg.

what's your suggestion for this problem? thanks.

Is it not enough to just replace it with QNN_DEFAULT_LIB_SEARCH_PATH?

@myan-o
Copy link

myan-o commented Apr 8, 2025

When I copied what I built on GoogleColab to Termux and ran it, I got the following error.

device: GT5 pro(snapdragon gen3)

[bin]$ ls $PREFIX/lib/libQnn
libQnnCpu.so
libQnnGpu.so
libQnnHtp.so
libQnnHtpNetRunExtensions.so
libQnnHtpOptraceProfilingReader.so
libQnnHtpPrepare.so
libQnnHtpProfilingReader.so
libQnnHtpV68CalculatorStub.so
libQnnHtpV68Skel.so
libQnnHtpV68Stub.so
libQnnHtpV69CalculatorStub.so
libQnnHtpV69Skel.so
libQnnHtpV69Stub.so
libQnnHtpV73CalculatorStub.so
libQnnHtpV73Skel.so
libQnnHtpV73Stub.so
libQnnHtpV75CalculatorStub.so
libQnnHtpV75Skel.so
libQnnHtpV75Stub.so
libQnnHtpV79CalculatorStub.so
libQnnHtpV79Skel.so
libQnnHtpV79Stub.so
libQnnSystem.so
[bin]$ LD_LIBRARY_PATH=".:/vendor/lib64" ./llama-server
 
[ggmlhexagon_load_cfg, 1756]: load hexagon appcfg from 
/data/data/com.termux/files/usr/lib/ggml-hexagon.cfg
[operator(), 1762]: section[cdsp      ],[enable_rpc_dma
_mempool   ] = [0]
[operator(), 1762]: section[cdsp      ],[enable_rpc_ion
_mempool   ] = [0]
[operator(), 1762]: section[qnn       ],[precision_mode
           ] = ["fp16"]
[operator(), 1762]: section[qnn       ],[enable_dlbc   
           ] = [1]
[operator(), 1762]: section[qnn       ],[vtcm_size_in_m
b          ] = [8]
[operator(), 1762]: section[qnn       ],[hvx_threads   
           ] = [4]
[operator(), 1762]: section[general   ],[hwaccel_approa
ch         ] = [2]
[operator(), 1762]: section[general   ],[print_qnn_inte
rnal_log   ] = [0]
[operator(), 1762]: section[general   ],[enable_q_mulma
t          ] = [0]
[operator(), 1762]: section[general   ],[print_tensors_
info       ] = [0]
[operator(), 1762]: section[general   ],[hexagon_backen
d          ] = [2]
[operator(), 1762]: section[general   ],[dump_op_info  
           ] = [0]
[operator(), 1762]: section[general   ],[enable_perf   
           ] = [1]
[operator(), 1762]: section[general   ],[version       
           ] = ["1.00"]
[ggmlhexagon_load_cfg, 1786]: internal ggml_hexagon_ver
sion=1.80
[ggmlhexagon_load_cfg, 1787]: internal ggml_dsp_version
=0.60
[ggmlhexagon_load_cfg, 1788]: external ggml_hexagon_ver
sion=1.00
[ggmlhexagon_load_cfg, 1790]: hwaccel_approach=2(HWACCE
L_CDSP)
[ggmlhexagon_load_cfg, 1792]: hexagon_backend=2(HEXAGON
_BACKEND_CDSP)
[ggmlhexagon_load_cfg, 1793]: runtime libpath=/data/dat
a/com.termux/files/usr/lib/
[ggmlhexagon_load_cfg, 1794]: enable_perf=1
[ggmlhexagon_load_cfg, 1795]: enable_profiler=0
[ggmlhexagon_init_dsp, 5209]: init Hexagon cDSP with ba
ckend 2(HEXAGON_BACKEND_CDSP)
[ggmlhexagon_init_dsp, 5280]: using Hexagon domain 3(He
xagon-cDSP)
[ggmlhexagon_init_dsp, 5281]: unsignedpd_enabled 1
[ggmlhexagon_init_dsp, 5327]: error 0x80000406: failed 
to open domain 3(Hexagon-cDSP)
[ggmlhexagon_deinit_cdsp, 5172]: enter ggmlhexagon_dein
it_cdsp
[ggmlhexagon_deinit_cdsp, 5185]: leave ggmlhexagon_dein
it_cdsp
[ggml_backend_hexagon_reg, 6376]: init hexagon dsp fail
ure
/root/.builder/source/bin/llama-cpp-hexagon-branch-pr_t
o_upstream/ggml/src/ggml-hexagon/ggml-hexagon.cpp:6378:
 GGML_ASSERT(0 == result) failed

@jeffzhou2000
Copy link
Author

@zhouwg
I would like to specify the termux lib path, but it is fixed and I would like to change it.
ggml-hexagon.cpp

        .runtimelib_path        = "/data/local/tmp/",

QNN_DEFAULT_LIB_SEARCH_PATH

thanks for your comment,
might-be we can add a runtime configure item in scripts/ggml-hexagon.cfg.
what's your suggestion for this problem? thanks.

Is it not enough to just replace it with QNN_DEFAULT_LIB_SEARCH_PATH?

env var might-be a good idea although it seems equivalent to a new runtime configuration item in scripts/ggml-hexagon.cfg. there are many runtime configuration items at the moment, an uniform approach might-be better for developers and users.

zhouwg added 29 commits July 10, 2025 23:01
…accel_approach) in ggml-hexagon.h for further usage
@ggerganov ggerganov closed this Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Qualcomm NPU script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants